First things first, let's get some terminology straight.
The language we're working in – Python 3.7
The editor we're using is Google Colab – The code runs on Google's servers, and shows the results on our browser
The specific notebook we're looking at now is an interactive Python notebook, a .ipynb file. These are pretty special, also known as Jupyter notebooks.
Jupyter notebooks have a few special properties that make it ideal for work with data:
print()x = 'Answer to the Ultimate Question of Life, the Universe, and Everything'
print(x) # Run this cell after running the one above, and again after running the one below
x = 42
We use the pandas package to easily work with data as tables.
The numpy package allows us to work with some other special data types, like missing values
We'll rename these as pd and np, just so its easier to refer to later on
# as allows us to rename the packages
import pandas as pd
import numpy as np
For this semester, we'll typically work with data in tabular format, the type you'd be used to in an excel spreadsheet. Data files saved in this format will usually have a .csv file ending, short for comma seperated values.
To import this, let's use the pd.read_csv() function:
# Replace w/ URL
url = 'https://raw.githubusercontent.com/ishaandey/node/master/week-1/trees.csv'
trees = pd.read_csv(url)
Here, we've saved the data to a dataframe object named trees
trees.shape
type(trees)
Let's take a look at the data. We'll use the functions .head() and .tail()
trees.head()
trees.head(10) # Show 10
trees.sample(3) # Choose 3 randomly
How big is the dataset? .shape returns a tuple with the dimensions as (rows, columns)
trees.shape
Let's take a look at some of values in the dataset.
- What are the different caretaker types?
- How many unique trees are there in the dataset?
trees.species_name.nunique()
trees.caretaker.unique()
Which tree shows up the most frequently?
trees.common_name.value_counts()
What are the biggest trees?
Note: DBH represents diameter of the tree at standing height
trees.sort_values(by='dbh', ascending=False).head()
Subsetting is a super helpful tool. We'll take a look at this more depth in next week, but for now, here are the basics:
Let's take a look at just the title, channel, views and likes. We can place these column names into a list, then subset the original dataframe by that list
cols = ['species_name', 'common_name', 'address']
trees_subset = trees[cols]
# Same thing as trees[['species_name', 'common_name', 'address']]
trees_subset.head()
We can filter rows from a dataframe based on some condition
- Show only trees north of Golden Gate Park (latitude > 37.77285)?
- Show only Cherry Plum trees
- How about trees only on Front, Back, and Side Yards?
trees[trees.latitude > 37.77285]
trees[trees.site_location.isin(['Front Yard','Side Yard','Back Yard'])]
trees[trees.common_name == 'Cherry Plum']
First things first, let's import the package to help us visualize the data, plotly.
If this package isn't yet included, we can install it using !pip install plotly. More on this week 5.
import plotly.express as px
## Run the following if graphs don't show
# import plotly.io as pio
# pio.renderers.default='notebook'
Note that we're using the sub package of the broader package, called plotly express. This simplifies a lot of the more difficult steps
Plotly express has a broad range of options to play with, let's take a look at the documentation.
Do a quick google search to pull up documentation for px.scatter OR run px.scatter? in a Jupyter cell
px.scatter?
fig = px.scatter(trees.sample(frac=.1), x='date', y='dbh')
fig.show('notebook')
Clearly, there aren't any obvious trends going on from this view. Let's add in some more parameters
trees_sample = trees.sample(frac=.2)
fig = px.scatter(trees_sample, x='date', y='dbh',
opacity=.15, color='site_location',
hover_name='common_name', hover_data=['site_location','site_type','address'],
marginal_x = 'histogram', marginal_y = 'histogram',
color_discrete_sequence = px.colors.qualitative.Prism[4:]
)
fig.show('notebook')
The transportation department wants to know track any trees sitting on a road median, in order to quickly remove debris after a bad storm.
Is there a general area in which there are more roadside / median trees?
fig = px.scatter_mapbox(trees_sample, lat='latitude', lon='longitude',
color='site_location', size='dbh', opacity=.4,
color_discrete_sequence=px.colors.qualitative.Prism[4:],
hover_name='address',hover_data=['common_name','site_location','caretaker'],
zoom=11, mapbox_style="stamen-terrain",
)
fig.show('notebook')